Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refactor(ir): remove the decimal precision promotion logic #8195

Merged
merged 1 commit into from
Feb 5, 2024

Conversation

chelsea-lin
Copy link
Contributor

Description of changes

This change is removing the decimal precision promotion logic from the unbounded ibis expr. The logic was introduced by #4330 so the decimal precision can matches the DB2. However, this rule is not generic for all. For example, BQ only support precision=38.

Issues closed

@cpcloud
Copy link
Member

cpcloud commented Feb 2, 2024

@webmiche Thoughts on dropping the type evolution here? It seems to be causing an issue for BigQuery which has much simpler promotion rules than DB2.

@kszucs kszucs force-pushed the the-epic-split branch 2 times, most recently from 432d151 to e4df99b Compare February 2, 2024 11:42
@kszucs kszucs changed the base branch from the-epic-split to main February 2, 2024 12:21
@kszucs
Copy link
Member

kszucs commented Feb 2, 2024

@chelsea-lin I rebased your branch and retargeted the PR against main.

@webmiche
Copy link
Contributor

webmiche commented Feb 2, 2024

@webmiche Thoughts on dropping the type evolution here? It seems to be causing an issue for BigQuery which has much simpler promotion rules than DB2.

For context, I have not been working a lot with databases and ibis lately given that my project finished.

I think this touches the exact core discussion we had back in the initial PR: type inference rules are backend-specific whereas ibis tries to build data structures that are as generic as possible. And I guess what we have here is exactly that, a backend that does not follow the same type inference rules.

I think there is a range of possibilities of what could be done for this case. From the top of my head:

  • inferring type rules according to the backend
  • define ibis specific type rules and add clue code for backends
  • make decimals "opaque" to some extent, i.e. make them not have a precision and scale at all (at least in some cases where this info is not directly encoded by the frontend)

I personally advise against inferring type inference rules from backend choice as this somewhat defeats the purpose of ibis in the first place. Also, the ibis specific type rules sound like just another specification on top of all the others, so that's no real solution.

I think the third choice is probably most reasonable. Honestly, I don't know exactly how the ibis data structures work currently, so I can't tell the ramifications of such a change. Maybe deleting the type inference/binop promotion-functionality already achieves this to some extent. Maybe, this change is only the first towards handling this in a principled way.

So, all in all, I am fine with dropping it. If I was still working on the project I implemented the type inference for originally, I would probably add some clue code to handle this in between the ibis data structures and my data structures. For this, I would definitely appreciate it if ibis communicates to me which decimals are defined by the frontend and which it is not sure about. The "opaque" decimal solution would achieve this, AFAIU.

Maybe @ingomueller-net has more thoughts on this.

@chelsea-lin chelsea-lin changed the base branch from main to the-epic-split February 2, 2024 17:29
@cpcloud
Copy link
Member

cpcloud commented Feb 2, 2024

@webmiche Thanks for the response! I was thinking something similar re opaque decimals. We took this approach a long time ago with strings, and it has not proven to be an issue while also allowing us to mostly ignore the various CHAR(N)/VARCHAR(N) string types.

I think we'd have to implement this and try it out on a few backends to see what its blast radius would be. I suspect there might be some issues at the Ibis <-> pyarrow boundary.

@chelsea-lin chelsea-lin changed the base branch from the-epic-split to main February 2, 2024 17:46
@chelsea-lin
Copy link
Contributor Author

chelsea-lin commented Feb 2, 2024

@chelsea-lin I rebased your branch and retargeted the PR against main.

@kszucs I also need the commit in the-epic-split branch. What's the process to do that?

@kszucs
Copy link
Member

kszucs commented Feb 5, 2024

We keep the-spic-split rebased on top of the main branch, so this change will reach both branches.

@chelsea-lin chelsea-lin changed the title fix: remove the decimal precision promotion logic from ibis.expr fix: remove the decimal precision promotion logic Feb 5, 2024
@chelsea-lin chelsea-lin force-pushed the chelsealin_numeric branch 2 times, most recently from cec4807 to d20f5de Compare February 5, 2024 20:11
@cpcloud
Copy link
Member

cpcloud commented Feb 5, 2024

Given this isn't breaking anything I'm inclined to merge it. @kszucs Thoughts?

@kszucs kszucs force-pushed the chelsealin_numeric branch 2 times, most recently from 794184d to 5ee75c8 Compare February 5, 2024 23:00
Copy link
Member

@kszucs kszucs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks @chelsea-lin!

@kszucs kszucs changed the title fix: remove the decimal precision promotion logic refactor(ir): remove the decimal precision promotion logic Feb 5, 2024
@kszucs
Copy link
Member

kszucs commented Feb 5, 2024

Given this isn't breaking anything I'm inclined to merge it. @kszucs Thoughts?

Agree.

@kszucs kszucs enabled auto-merge (rebase) February 5, 2024 23:03
@kszucs kszucs merged commit 0db3ec7 into ibis-project:main Feb 5, 2024
82 checks passed
@ingomueller-net
Copy link

Thanks, everybody, for the work on Ibis and this issue!

What I have asked myself in the past due to this and similar issues is: what is the degree of portability that Ibis aims for? Since the precision promotion logic is different in some backends, Ibis programs may result in different results when run on different backends. Is that expected? Is there a place where these differences are tracked? When do we consider a backend as "correct" if we allow these kind of differences?

@cpcloud
Copy link
Member

cpcloud commented Feb 6, 2024

@ingomueller-net Good questions!

what is the degree of portability that Ibis aims for?

I don't have a general and precise answer for this.

Here are some aspects of Ibis's output that aren't portable across backends and likely never will be:

  • Aggregations involving floating point arithmetic (any parallelization has the potential to defeat any expectation of exact equality)
  • Decimal precision and scale is not uniformly supported across backends. For example, postgres supports arbitrary precision decimals, while BigQuery supports up to 32 bytes with BIGDECIMAL.

I'm sure there are others.

Is that expected?

Depends. Generally speaking, no, but there's functionality that is out of control that cannot be replicated across backends.

Is there a place where these differences are tracked?

At least I can answer this one definitively: no there is not :)

Perhaps a docs page with specific details here is in order!

When do we consider a backend as "correct" if we allow these kind of differences?

It would be difficult to do without a definition of correctness. I'd like to see if there's a way we can capture the backend differences first, and then see whether such a definition comes out of that.

@ingomueller-net
Copy link

Thanks for the prompt answer, @cpcloud!

  • Aggregations involving floating point arithmetic (any parallelization has the potential to defeat any expectation of exact equality)

That's a good one! And somewhat tricky: one could argue that any possible order is "correct" by accepting any possible ordering (like the SQL standard does, I believe).

[ Off topic: There are actually quite a number of sources of non-determinism. I just now remembered that, in some previous life, I wrote a paper about them as well as a potential solution, which, however, I think no real system ever implemented ;) ]

I'm sure there are others.
[...]

Is there a place where these differences are tracked?

At least I can answer this one definitively: no there is not :)

Perhaps a docs page with specific details here is in order!

When do we consider a backend as "correct" if we allow these kind of differences?

It would be difficult to do without a definition of correctness. I'd like to see if there's a way we can capture the backend differences first, and then see whether such a definition comes out of that.

+1

@ingomueller-net
Copy link

To add to the issue of aggregations, note that the result of any non-associative function depends on the input order, including first, last, array_agg (not sure if it exists), sum on floats, but also sum on integer in most cases (since it may depend on the order whether or not an overflow occurs).

And then the result of most window functions obviously depends on the order.

Again, all of these are non-deterministic in many SQL systems, so you don't get a guarantee to obtain the same result from run to run against the same system instance there either. I, thus, think it'd be perfectly valid to declare all possible orderings as "correct." However, it has to be clear what exactly the semantics of each operation are, possibly referring to non-determinism for some operations.

@ingomueller-net
Copy link

ingomueller-net commented Feb 7, 2024

Oh, and another point that came up when I asked the same question on substrait: The semantics of ORDER BY aren't the same in every system, for example, for NaNs which could come before or after numbers.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

bug(bigquery): Numeric precision increase 1 unexpectedly after every sub/add operations
5 participants